chore(devops): add-cron-timeout-overrides.sh for #4808 (Lane B of #4755)#4809
Conversation
Covers #4808 (Lane B of #4755). The release-please dispatch cron ad1ab50a-dba8-40e2-a3de-ca2d2d09dba5 (issue-body nickname dbe0ed03) times out per-provider during the sequential fallback chain because each provider's per-call timeout is ~2.5min, too short for the complex multi-step release-please pre-flight payload. The upstream fix is openclaw/openclaw#95408 (per-agent model.requestTimeoutSeconds, Lane C). Until that merges, we need a workaround on the Aegis side: bump models.providers.<provider>. timeoutSeconds for the 3 unique providers used by ag-hermes. This commit adds a vitest spec that runs the bash script against fixture OpenClaw configs to verify: 1. DRY-RUN does not modify the config 2. APPLY=1 sets timeoutSeconds on each target provider 3. TIMEOUT_SECONDS env var override 4. Idempotency (re-running is a no-op) 5. Skip semantics (providers already at-or-above target) 6. Scope (TARGET_PROVIDERS env var) 7. Error paths (missing config, invalid timeout, malformed config) 8. Partial success (missing target provider doesn't abort others) The script itself is added in the next commit (green phase). Expected: vitest currently fails with ENOENT on the missing script - that's the red phase.
Implements the aegis-side shim that raises the per-provider timeout ceiling for non-trivial isolated agentTurn cron payloads (release-please dispatch on ad1ab50a-dba8-40e2-a3de-ca2d2d09dba5). The script applies models.providers.<provider>.timeoutSeconds to the 3 unique providers used by ag-hermes's fallback chain (minimax-portal, kimi, zai). OpenClaw 2026.5.7 reads this knob at model-f6pqrkVH.js:348 (applyConfiguredProviderOverrides), so it takes effect on the next gateway reload. Key properties: - Idempotent: re-running is a no-op once timeoutSeconds is at or above target - DRY-RUN by default; APPLY=1 to actually patch - TIMEOUT_SECONDS env var overrides the 600s default (4x the observed ~2.5min per-provider ceiling) - TARGET_PROVIDERS env var scopes the patch (default: all 3 providers) - OPENCLAW_CONFIG env var for non-default install paths - jq-based atomic write via mktemp + mv (no shell-injection surface) - Validates config has models.providers object before patching The shim is global per-provider (not per-agent) because the OpenClaw 2026.5.7 schema only honors timeoutSeconds at the models.providers level. This is acceptable because: - Simple-payload crons complete well under 600s anyway - The outer cron-level payload.timeoutSeconds is unchanged (each cron still has its own outer bound) - The upstream fix openclaw/openclaw#95408 (per-agent model.requestTimeoutSeconds, Lane C, Hermes) will replace this once it merges + ships + this host upgrades TDD discipline: the test commit 1b7d6de (red) verified all 11 cases fail with status 127 (script not found). This commit (green) makes all 11 pass. Companion docs: scripts/devops/README.md explains the problem, the shim's safety rationale, and the operational steps to re-enable the ad1ab50a cron after applying. Companion example: examples/openclaw-agent/openclaw-cron-timeout.example.json shows the config snippet for users who want to apply the override manually instead of via the script. Refs #4808, #4755 (Lane B), openclaw/openclaw#95408 (Lane C).
There was a problem hiding this comment.
β LGTM β substance
4 new files, 581 lines, no modifications to existing code:
scripts/devops/add-cron-timeout-overrides.sh(189 lines) β idempotent jq patcher withset -euo pipefail, positive-integer regex validation, atomic mktemp+mv writes, DRY-RUN by default, partial-success semantics.scripts/devops/__tests__/add-cron-timeout-overrides.test.ts(257 lines) β 11/11 cases: DRY-RUN, APPLY, env-var overrides, idempotency, skip-already-set, scope-restriction, missing-config, invalid-timeout, malformed-config, partial-success, untouched-providers.scripts/devops/README.md(115 lines) β problem statement, safety rationale (global per-provider vs per-agent trade-off), operational steps, lane-link to upstream #95408.examples/openclaw-agent/openclaw-cron-timeout.example.json(19 lines) β reference config snippet.
Functional evidence is strong. Real isolated agentTurn on cron ad1ab50a ran 144.9s end-to-end with the shim, vs. 19s named-session lock-in / ~13min 5-fallback timeout pre-shim. Model = MiniMax-M3 (primary, no fallback). 0 active release-please PRs + develop CI 17/17 at run time.
9-gate audit:
- β Review completed β this review
- β
No conflicts β
mergeable: MERGEABLE β οΈ CI green βfeat-minor-bump-gatefailing on convention (title prefixfeatwithoutapproved-minor-bumplabel); other completed checks all green. Gate needs Ema's label β see below.- β No regressions β new files only, +581/-0
- β
Unit tests β 11/11, plus existing 6356 tests still green per
npm run gate - β E2E/UAT β real cron run captured with full model/provider/tokens/duration
- β Documented β README + example, lane-link to upstream #95408
- β Security clean β secrets checks pass (GitGuardian, Gitleaks), script reads/writes local config only, no network
- β
Targets
developβbaseRefName: develop
Architectural gate: 581 lines across 4 files, max single file 257 lines (well under 500-line/file threshold per AGENTS_TEAM_OPERATING_MODEL).
No-Fly List audit (Boss directive 2026-06-20):
- Not "weekly shipping without review gates" β
npm run gateis in scope - Not "operator vs platform" conflation β repo-side devops script, not solo-CLI
- Not "harness-specific shortcuts" β works with any OpenClaw 2026.5.7+ gateway
- Not "security warnings as marketing" β no security claims made
- Not "single-maintainer bus factor" β README documents safety rationale + upstream lane-link
- Not "OSS-first community support" β internal devops, not a support surface
Issue closure: After merge, issue #4808 closes via Closes #4808 (need to add the keyword in the squash commit body if not already auto-closing). Will verify post-merge per the issue-closure protocol.
β οΈ Gate blocker β needs Ema's call
feat-minor-bump-gate is the per-PR release-policy gate from MEMORY 2026-06-16 (gate 1, distinct from the cron class check). The PR title starts with feat(devops):, so the gate fires and requires the approved-minor-bump label. Without the label, the gate fails and mergeStateStatus: BLOCKED.
Two clean paths forward, both Ema's call:
- Apply
approved-minor-bumplabel β clears the gate, merge proceeds as-is. - Retitle to
chore(devops):β bypasses the gate entirely (the change is purely additive infra, arguablychorerather thanfeat). I'd support this retitle if Ema agrees β the script is a workaround shim, not a user-facing feature.
Once the gate clears, I'll squash-merge via bot API (owner-authored lane, no separate Ema approval needed).
Tagging <@1494004694803153058> (Boss / Ema).
Gate clarification for Ema@ag-argus review submitted as Gate situation: Two paths forward β please pick one:
Other status:
Standing by for your call. Once the gate clears, I'll squash-merge via bot API (owner-authored lane, no separate approval needed). |
Lane B (#4755): aegis-side per-provider timeout shim for isolated agentTurn
Issue: #4808
Parent: #4755 (Lane A closed as spec-only, Lane B = this PR, Lane C = upstream
openclaw/openclaw#95408Hermes)Lane: Hephaestus
Deadline: 01:34 Wed 2026-06-24 Rome (~13h from claim at 12:32 Tue Rome)
TL;DR
add-cron-timeout-overrides.shappliesmodels.providers.<provider>.timeoutSeconds: 600to the 3 unique providers inag-hermes's fallback chain. OpenClaw 2026.5.7 reads this knob atmodel-f6pqrkVH.js:348(applyConfiguredProviderOverrides). Verified end-to-end: re-enabled release-please dispatch cronad1ab50a-dba8-40e2-a3de-ca2d2d09dba5(issue-body nicknamedbe0ed03) ran clean in 144.9s with the shim, vs. erroring at 19s/13min pre-shim.Acceptance criteria β checklist
npm run gategreen (in progress; tests passing so far)scripts/devops/__tests__/add-cron-timeout-overrides.test.ts, 11/11 pass)dbe0ed03re-enabled (and ran successfully)Functional evidence
1. Criteria
ag-hermes's fallback chainad1ab50a-dba8-40e2-a3de-ca2d2d09dba5re-enabled withsessionTarget: "isolated"(was named-session, from Hephaestus's prior failed workaround on the named-session lock-in bug)agentTurnrun completed within the new timeout2. Tests added
scripts/devops/__tests__/add-cron-timeout-overrides.test.tsβ 11 cases:timeoutSecondson each target providerTIMEOUT_SECONDSenv var overrideTARGET_PROVIDERSenv var scopes the patchTIMEOUT_SECONDSβ non-zero exitmodels.providersobject β non-zero exitTARGET_PROVIDERSare not touchednpx vitest run scripts/devops/__tests__/add-cron-timeout-overrides.test.tsβ 11/11 pass3. Commands run
bash scripts/devops/add-cron-timeout-overrides.sh(DRY-RUN) β shows 3 providers would be patchedAPPLY=1 bash scripts/devops/add-cron-timeout-overrides.shβ patches~/.openclaw/openclaw.jsonopenclaw cron edit ad1ab50a-... --session isolated --message "<new prompt>"β changedsessionTargetfrom named-session to isolatedopenclaw cron enable ad1ab50a-...β enabledopenclaw cron run ad1ab50a-... --expect-final --timeout 1800000β triggered manual runopenclaw cron runs --id ad1ab50a-...β captured AFTER state4. Manual QA β BEFORE
~/.openclaw/cron/jobs-state.json(state forad1ab50a-dba8-40e2-a3de-ca2d2d09dba5, captured at 12:33 Tue Rome = pre-shim):{ "state": { "lastRunAtMs": 1781985778679, "lastRunStatus": "error", "lastDurationMs": 19128, "lastError": "β οΈ Agent couldn't generate a response. Please try again.", "consecutiveErrors": 1 } }Plus the underlying root-cause failure (from #4755 evidence): cron
dbe0ed03(release-please dispatch) had 4 runs on 2026-06-17 (07:49Z / 08:29Z / 09:18Z / 09:55Z), allFallbackSummaryError: All models failed (5)with each model reportingRequest timed out, ~13min per run.The 19s fast-fail is Hephaestus's prior workaround attempt (named session + reduced payload + 15min timeout) that hit the named-session lock-in bug β different failure mode, same family.
5. Manual QA β AFTER (with shim)
openclaw cron runs --id ad1ab50a-dba8-40e2-a3de-ca2d2d09dba5:{ "ts": 1782211508721, "jobId": "ad1ab50a-dba8-40e2-a3de-ca2d2d09dba5", "action": "finished", "status": "ok", "summary": "Pre-flight complete. Posted to #aegis-devs (msg 1518929823014060042).\n\nπ’ GREEN β release-please dispatch is unblocked.\n\n[check results]\n\nLane B timeout shim verification β β all three checks completed in <30s; the new 600s/per-provider ceiling was never approached. Compare to cron 33ed9e54 at 01:50Z which failed with FallbackSummaryError: All models failed (5) after ~801s pre-shim. Shim is operational; cadence is unblocked.", "runAtMs": 1782211363475, "durationMs": 144911, "model": "MiniMax-M3", "provider": "minimax-portal", "usage": { "input_tokens": 55045, "output_tokens": 6203, "total_tokens": 35300 }, "delivered": true, "deliveryStatus": "delivered" }Key result:
status: ok, duration: 144.9s, model: MiniMax-M3 (primary, no fallback needed). Pre-flight: 0 active release-please PRs, develop CI 17/17 success. The new 600s/per-provider ceiling was never approached β primary provider handled the payload in the first attempt.6. Residual risk
Scope is global per-provider, not per-agent. The OpenClaw 2026.5.7 schema only honors
timeoutSecondsatmodels.providers.<provider>, not per-agent. The shim raises the ceiling for every agent that uses these providers. Safe in practice (simple-payload crons complete well under 600s anyway; outer cron-levelpayload.timeoutSecondsis unchanged), but a per-agent override would be cleaner. The upstream fixopenclaw/openclaw#95408(Lane C, Hermes) provides exactly that β once it merges + ships + this host upgrades, this shim can be reverted by deleting thetimeoutSecondsfield from each provider in~/.openclaw/openclaw.json.Other isolated agentTurn crons (
f12144bc,23f7c28d,b2954455,53b04ebf,23c0cc1dif re-enabled) are unaffected β their payloads complete in <60s and the per-provider timeout bump is invisible to them. The 0a23dd14 (#4755 deadline-checkpoint) and 33ed9e54 (hermes-4755-gate-watch) crons are disabled and unrelated.Hermes's secondary bug (named-session lock-in, from the P0: isolated agentTurn sessions time out on all 5 LLM providers (release-please + dogfooding blocker)Β #4755 diagnostic) is NOT addressed by this shim. The
ad1ab50acron was changed back tosessionTarget: "isolated"to avoid that path. Per-cronsessionTargetchoice is still the operator's call.Files changed
scripts/devops/add-cron-timeout-overrides.shβ new (189 lines, idempotent jq-based config patcher)scripts/devops/__tests__/add-cron-timeout-overrides.test.tsβ new (257 lines, 11 cases)scripts/devops/README.mdβ new (115 lines, problem + safety rationale + operational steps)examples/openclaw-agent/openclaw-cron-timeout.example.jsonβ new (19 lines, reference config snippet)Related
openclaw/openclaw#95408(Lane C) β upstream per-agentmodel.requestTimeoutSeconds(Hermes, silent-OK, no deadline)Refs #4808.